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A METHOD AND SYSTEM FOR PREDICTING AMINO ACID 
SEQUENCES COMPATIBLE WITH A SPECIFIED THPOEE DIMENSIONAL 

STRUCTURE 



FIELD OF TBDE INVENTION 

Thi s invention relates to the field of protein design and more particularly to the 
field of inverse-protein folding for de novo protein design. 

PRIOR ART 

5 The fol!o>ving is a list of prior ggrt which i$ considered to be pertinent for 

describing the state of flie art in. the field of the invention. Acknowledgement of these 
references herein wdil be made by indicating the number firom tiieir list bedow within 
brackets. 

I. Desjarlais J.R. & Handel TM. Protein Science 4:2006-2018 (1995). 
10 2. Lazar G.A. et al. Protein Sci. 6:1 167-1178 (1997). 

3. Hellinga H.W. et al. J. Mol. Biol. (1991), 

4. Hurley H.W. et al. .1 Mol Biol. 224:1 143-1154 (1992). 

5. Harbuiy P.B. et al PNAS USA 92:8408-8412 (1995). 

6. Klemba M. et al. Nat Struc. Biol. 2:368-373 (1 995). 

15 7. Nautiyal S. et al. Biochemisiry 34:11645-11651 (1995). 

8. Betz S.F- et al. Biochemistry 35:6955-6962 (1996). 

9. Dabiyat B. 1. et al. Protein Science 5:895-903 (1996). 

10. Jones D.T, et al. Protein Scisnce 3:567-5 74 (1 994), 

I I . Kono H & Doi J. Proteins: Structure, Function and Genetics 19:224-255 (1994). 
20 12. . Desjnet J. et al. Nature 356:539-542 (1992). 

13. Dahiyat B.L & Mayo S.L. Protein Sci 5:895-903 (1997). 
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14. Dahiyat BX & Mayo Proc, Natl Acad. Set USA 94:10172.10177 (14987). 

15. Malalcauskas S- M- and Mayo S. L. Nat Struct. BioL 5:470-475 (} 998). 

BACKGROUND OF THE INVENTION 

Depending on the primary staicture and the environment, proteins fold into a 
5 three-dimensional (3D) structure containing recurring motives vvhich pack together to 
form the 3D structure, the most common motives observed being the a-heUx, p-tum, 
paraiiei and anti-paraOe] p-sheets- 

The 3D staicture of a protein may be characterized as having internal surfaces 
being the areas buried within the structure and thus directed away from the aqueous 

10 envirori:ment in which the protein is normally found; external surfeces being the areas 
exposed to the aqueous environment and intermediate or boundar>' surfaces. Through 
the study of many natural proteins, researches have discovered that hydrophobic 
residues are most frequently found on the internal surface of water soluble protein 
molecules while hydrophilic residues are most frequently found on the external protein 

15 surfaces. 

It was established that while the biological properties of a protein depend 
directly on the protein's 3D conformation, onJy some of the information in the protein's 
sequence is necessary to specify its fold^ i.e. a given native structure may be fonned 
- from many different sequences [Lau K,F. and Dill K. A. FNAS USA 87:638-652 

20 (1990)]. The different sequences compatible with a given 3D structure are referred to 
as tine structure's Sequence Space. The finding that a number of amino acid sequences 
may fold into the same basic 3D structure, have focused attention on a new field 
commonly referred to as the "inverse protein folding ' or "de novo protein design 
wliiie conventional protein folding methods are trying to predict the tertiary structure 

25 of a protein, from their amino acids sequence, protein design me&ods are looking for a 
sequence that will stabilize a given fold, by using the same principals* 
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Rcports of experimentiilly predicted amino add sequences, which adopt an 
intended fold and possess physical properties simiJar at least in part to those of natural 
proteins are appearing with increasing frequency [Kortemme T, et aL Science 
2Si:253-25& (1993); Kwda Y. et al, Mol Biol 236:862-868 (1994); Quinn T, P. et 

5 al PNAS USA 91:8747-8751 (1994); Fezoui Y. et al PNAS USA 91:3675-3679 (1994); 
Bets SJ et al Cwr. Opin. Siruc. Biol 5:437-463 (1995); Raleigh D.P. et al 1 Am: 
Chem. Soc. 117:755-7559 (1995); Regan. L. & DeGardo WJ. Science 241:976-978 
(1988); Hecht M.H. et al Science 249:884-891; Beauregard M. et al Protein Eng. 
4:745-749 (1991) Kamtekar S. et al Science 262:1680-1685 (1993)], These studies 

JO have been predommantly experimental and rely on Icnowledge of the physical 
properties that determine the protein's stnacture, such as the patterns of hydrophobic 
and hydrophilic residues in the sequence. 

Several groups have applied an experimentally tested systematic, quantitative 
methods to protein design with the goal of developing general design algorithms. 

15 Desjarlais an.d Hande) ^'^ were tiie first to experimentally investigate predictions 
generated by genetic algorithms (GA). They have developed ROC ("Repacking of 
Cores"), a computational program that attempts to find novel core sequences given the 
backbone structure of the protein of interest. In different, however related, work, a 
modij5:cation of the ROC was used on the secondary structure of the ap protein 

20 ubiquitin^^l The program used a genetic algorithm to optimize the search for 
alternative core structures for a given protein. Other experimentally tested methods 
applied with respect to protein design are described clscwhcre^^'^^\ Thus, in some 
cases, uniquely folded and even functional globular proteins may be obtained using 
higlily simplified minimally designed cores- The algorithms consider the spatial 

25 positioning and steric complement of side chains by explicitly modeling the atoms of 
sequences under consideration. However, despite the success of these studies, a fiill 
predicti.ve UDderstanding of hydrophobic core packing in proteins has not yet been fiilly 
realized, and de novo design of stable and unique proteins, remains a challenging 
problem. 
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A major breaktbxoxigb was achieved by the Dead-End Elimination (DEE) 
algorithm by Dcsmet et al^^K \vhich v/as ori.ginalIy developed for homology modeling. 
DEE £ind5 and eliminates rotamers tb.at are mathem.aticaily provable to be inconsistent 
(or dead ending) witb. the global minimum energy solution of the system. 

5 Dahiyat and Mayo^^^^ furtb.er adapted the algorithm by Desmet for tlie explicit 

exploration of sequence space using semi-empirical potential functions and 
stereochemical constraints, which intended to capture most of the known contributions 
of protein stability. In their design strategy they succeeded in expanding me range of 
computational protein design to residues of all parts of ttie protein: the buried core, tb.e 

1 0 solvem-exposed surface, and the boundary between core and surface, 

SUMMARY OF THE INVENTION 

In accordance with a first of its aspects, ti3.e present invention relates to a 
computer-implem.ented method for predicting at least one amino acid sequence 
compatible with a predefined three-dimensional (3D) structure, which method comprises 
15 the steps of> 

a) providing a coordinate set representing the backbone of said 3D structure; 

b) constructing a reduced virtual representation for the 3D structure provided in 
step (a); 

c) deierminiag for each position along the virtual structure representation provided 
20 in step (b) its solvent accessibility; 

d) constructing an initial amino acid sequence by assigning each position along the 
sequence an amino acid residue selected randomly from a predefined group of 
amino acids having a solvent accessibility (SA) compatible with the solvent 
accessibility determined for each position; 

25 e) ran.domly selecting one or more positions al.ong the sequence provided in step 
(d) and applying on each position a Monte-Carlo simulation in sequence space 
and rotamer space^ said simulation comprising one or more scoring function 



calculating steps which jnclude:- 

i) randomly selecting one or more amino acid residues of the same solvent 
accessibility as that defined for said position to provide a mutation; 

ii) calculating an energy scoring function for each possible rotamer of each 
amino acid residue provided in step (i) based on their said reduced virtual 
representation; 

iii) selecting the lowest scorjag lOtaxuer or when more than one amino acid is 
manipulated simultaneously, selecting the lowest scoring rotamer 
combination; 

iv) deienmning ether to accept or reject the mutation with the rotamer or 
rotamer combination selected in step (iii), by applying, for example, the 
Metropolis algorithm; and 

v) assigning the amino acid residue or residues and their respective selected 
rotamer or rotamer combinations selected in step (iii) to said position/s and 
moving to another position along the sequence; 

said simiolation steps are repeated until for each position along said sequence, 
the residue and residue's rotamer wj.th thQ lowest score is selected, to obtain a virtually 
represented amino acid sequence with the lowest total score; 

f) expanding the reduced representation of the amino acid sequence obtained in 
step (e) to its coiresponding all-atom sequence representation thereby obtaining 
an amino acid sequence compatible with said predefined 3D structure; and 

g) opnonaliy, creating a computer output of the expanded ail-atom representation 
of the amino acid sequence obtained in step (f). 

According to a second aspect, the invention provides amino acid sequences 
which fold into predefined 3D structures, the amino acid sequences being obtained by 
the method of tfa.e present invention. 
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Furthermote, the invention provides, in accordance with another of its aspects, a 
computer-based .system for predicting an araiao acid sequence compatible with a 
predefined 3D structure comprising a computer device equipped with:- (a) input 
apparatus, such as a keyboard, for speciJEying said 3D structure; (b) a first memory foi 

5 storing the specified 3D structure; (c) a second memory having a stored thereon an 
application progi'am "which when running, provides at least one amino acid sequence 
con:ipat3ble with the specified 3D structure; (d) a third memory for storing fee at least 
one amino acid sequence obtained: (e) a processor coupled to said input means, and to 
said first, second and third memories for representation of said amino acid sequence; 

(0 and (f) optionally; a display unit coupled to said processing means for displaying the 
aroino acid sequence. 

The specified 3D structure may be obtained fxow a data banic accessible through 
the network or available on diskette, CD or tape which is then downloaded onto tbe 
f5Tst nxemory module. Thus, the term "input apparatus'' signifies also any suitable 

15 means for connecting to a network and retrj.eving from available databanks accessible 
thereby the desired 3D structure. Furthermore, input apparaius also refers to any 
apparatus enabling retrieving such sequences from computer readable mediums, e,g. 
diskettes, CDs, tapes etc. 

Tbe processor may be any computer device stored with an application utility^ 

20 wbich when running on tbe computer device^ enables the processing of tbe stored data 
so as to provide a an amino acid sequence which substantially folds into a desired 3D 
structures i,e- lliat specified in step (a) of the method of the invention, such a computer 
device includes, inter alia, a private computer (PC, either Windows or Linux OS), 
workstation computers (UNDC), a computer-cluster or Super-computers. 

25 

BRIEF DESCRimON OF THE DRAWINGS 

In order to understand the invention and to see how it may be carried out in 
practice, some non-limiting examples will now be described, with reference to tbe 
accompanying drawings, in which: 
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Fig. 1 shows the energ>' profile obtained for tb.e Zin268 backbone by the systera 
of the invention, during 10^ iterations, at three tecnperatures: lOOOK (continuous line), 
500K (broken lin.e) and iOOK (dotted line). The temperature remained constant during 
the simulation. The energy of the initial random sequence, before simulation biitiated 
5 was +204kcal/i33oL 

Fig 2 shows the energy profile obtained for the Zin268 backbone by the system 
of the invention during 10^ iterations, at three maximal temperatures: lOOOK 
(continuous Uae), 500K (broken Uric) and lOOK (dotted line) using an annealing 
temperature profile, with periodicity of 500 Monte Carlo steps during which the 
)0 temperature is gradually decreased from its initial value, to zero, and then set up again 
to its initial vaJue for another cycle. The energy of the initial random sequence, before 
the simulation started was +204kcal/mol. 

Fig 3 shows the energies of the 20 lowest sequences generated by the algorithm 
at different simulation lengths and different temperatures, using an annealing 
1 5 temperature profile with periodicity of 500 Monte Carlo steps. 

Fig 4A-4C shows the three dimensional structure crystallography of Zi£i68 
(Fig. 4A) compared with the 3D structure of the desigD.ed proteins A and B (Figs. 4B 
and 4C, respectiveJy). 

f 

^ fig 5A-5C shows the 3D crystallography structure of Zif268 (Fig 5A) 
20 compared with the three dimensional structure of the designed proteins A and B (Figs 
5B and 5C, respectively), after minimization of their side chains, displayed by spheres 
sized to tb.e van der Walls radii of tJie atoms (not including hydrogens)- 

Fig 6 shows a diagram of Gpl solvent accessibility, according to the present 
invention's metiiodoio©- (black columns) and according to D&M (gray columns), 

25 Fig 7A-7B shows the 3D structure of Gpi overiaid on that of the designed 

sequence C from two different angles (Fig, 7A and 7B). 
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DETAILED DESCRIPTION OF THE INVENTION 

The present invention relates in general to a method of predicting one or more 
amino acids compatible with a predefined 3D structure. The predefined 3D structure 
may be that of a native protein, polypeptide, a biologically functional derivative or 
5 fraction of the native protein or polypeptide or any other biologically functional 
polymer, the deteroiimation of its lowest fiee-energy-structure is desirable. 

In the current disclosure, above and below, the terms 'amino acid sequence'', 
'^primary sequence'' and other similar terms or derivation thereof may be used 
interchangeably. These terms, as used herein, refer to an amino acid sequence of a 
to protein or polypeptide. The primary structure of a protein or polypeptide is the amino 
acid sequence wherein liie location of disulfide bridges, if any exist, are indicated. The 
prirciiry structure is thus a complete description of the covalent connections within the 
polymer. 

The term ' amino acid^^ as used herein above and below means any organic 
15 compound possessing one or more amino groups and one or more carbojiyl groups* 
Such amino acids may be naturally occurring L-amino acids, their con-esponding 
D~isomers, sjTithetic amino acids, or any other variatj.ons of the same. Witiiin this 
context, the term variant should be understood as including all possible modifications 
of the naturally occurring or synthetic amino acids including delexions, insertions, 
20 substi.tutj.on5 of group/s therein. By the term "ammo acid residue'^ it should be 
understood an amino acid, as defin.ed above^ which forms part of a chain, the chain 
consisting two or more amino acid units. 

Tht coordinate set including the dihedral angles and specific bonds wjtbin tb.e 
predefined 3D structure, may be obtained from any suitable databank known to those 
25 versed in the art, such as the Protein Data Bank (PDB, supported by the RCSB 
consortium) and is preferably provided in a computer readable form to enable its easy 
input into the system of the invention. Alternatively, the 3D structure may be defined at 
will, with.out relying on any known 3D strucTure of any specific protein. Such novel 3D 
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stnictures wiU agree with the genera) structure constraints of polypeptides, such as 
backbone geometries, as known to those vened in the art. 

According to the method of the iuvention, a reduced virtual representation is first 
constructed for the predefined 3D strucmre. Tne reduced representation may be obtained 

5 by the methodoiogy originally developed by Her^! and Hubbard for use with dynamic 
simulated annealing [Herzyk P. and Hubbard R.E. Proteins 17:310-324 (1993)]. 
According to tiiis methodology, the amino acids axe represented by virtual spherical 
atoms, wherein the main chain of the protein, polypeptide or any other suitable pol>'mer 
is represented by one virtual atom per residue located at the Ca position and the side 

10 chains are represented by one or more additional virtual atoms. The number of 
additional virtual atoms depends on the size and chemical composition of the specific 
side chain. 

Typically, one additional virtual atom will represent amino acid residues having 
only a |5 side chain heavy atom or p and y side chaijijs heavy atomS:, e.g, serine (Ser, S), 

15 threonme (Thr, T), alanine (Ala, A), valine (Val, Y), cysteine (Cys, C). Proline will 
.also consist part of this group as its Co heavy atom is very close to its Ca and Cp 
atoms. Two additional virtual atoms will represent amino acid residues Mving y and 
S side chains heavy atoms, (3 being represented by one virtual atom and y and 5 
xogetiier by another virtual atom. It should be noted that the representation with, two 

20 additional side chain virtual atoms exMbit rotational jHexibility around the Cp-Cy bond, 
e,g. histidine (His, H), aspartic acid (Asp, D), asparagine (Asn, N), tyrosine (Tyr, Y), 
leucine (Leu, L), isoleucine (lie, I), phenylalanine (Pbe, F) and methionine (Met, M). 
Three additional vir&ml atoms will represent amino acid such as lysine (Lys, L), 
arginine (Arg, R), glutamic acid (Glu, E), Qlutamine (Gin, Q) and tryptophane (Trp, 

25 W). Evidently, amino acids, otiier than tibe naturally occurring amino acids (e.g. 
cbemjca) modifications or synthetic variatjons thereof) may be presented by virtual 
atoms in a similar manner. 

After constructing the reduced representation for the 3D structure provided, ^e 
extent of solvent accessibility at each position along the 3D structure is determined. In 
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priiacipai, solvent accessibility (SA) is a feature assigned for each position along the 
chain of the folded protein or polypeptide. Each position is categorized as being either 
buried within the 3D structure (in an iiitero.al surface), exposed (part of an external 
s\irface) or within a boundary surface {intermediaie position). 

5 According to one preferred enotbodiment of the invention, the SA. is determined 

by surrounding the reduced representation of the protein with a grid and calculating the 
number of grid points that fal) into the intersection volume of the volume of every 
virtual atom an.d the volume of its neighbor virtual atom (the volxime determined 
according to the adequate van der Waals radius [Bernstein F. C. et al J. Mol Biol. 

10 112:535-542 (1977)]). However, it should be clear tliat other ways of determining an 
amino acid's SA can be employed as may be known to the mm versed in the art. 

Each type of position, i.e. buried, exposed or intermediate, may be occupied by 
several amino acid residues. Hydrophobic amino acids, being able to form a 
hydrophobic core, are assigned to the buried positions of the 3D structure, while 
15 hydrophilic amino acids are assigned for the solvent-exposed positions. Boundary 
positions, between those two environments can be occupied by both types of amino 
acids. 

According to one particular embodiment of the invention, the buried positions 
may be occupies by amino aotds selected from the group consisting of Ala, Tyr, Trp, 
20 Val, Leu, He, Phe, Met, Cys, Pro, Gly and variants thereof, all of which being 
hydrophobic in nature. 

The exposed positions may be occupied by amino acid residues selected from the 
group consisting of Lys, Arg, His, Giu, Asp, Gin, Asn, Ser, Tbr and variants th^eof all of 
which being hydrophilic in nature. 

25 The positions having assigned an intermediate level of SA, may be occupied by 

all types of amino acids, particulariy those which serve in nature as building blocks for 
proteins, i.e. Pro, Lys, Arg, His, GIu, Asp, Gin, Asn, Ser, Tbr, Gly, PA% Tyr, Trp, Vai. 
Leu, lie, Phe, Met, Cys and variants thereof. 
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Tt should be noted, however, that the above classification of the buried, 
intermediate and exposed residues is only one exaxx^ple of classification and these 
groups may be changes. 

Special assignment of paiterns may be set for particular positions in the 
5 protein's 3D structure that deviate from the general assignment based on SA^ such 
assignment may be introduced, for example, to preserve buried salt bridges. 

After assigning each position with its characteristic SA (1) an amino acid for 
every Ca position is selected randomly, talcing into account the solvent accessibility of 
that position (buried, exposed or intermediate). Atematively, this selection may be 
to applied onJy to a sub-set of the polypeptide's amino acid residue, leaving the other 
positions fixed throu^out the design process: (2) the appropriate bonds and aagl.es of 
the protem's reduced representation are assigned based on the coordin.ate set provided^ 
and (3) one of the physically permissible rotamers that characterizes each of the amino 
acids is assigned. 

15 A scoring function is applied to evaluate the effect of changing the amino add 

sequence and residue rotamer for a given 3D backbone structure. The scoring function 
used according to the present invention has two main contributions: a residue-residue 
interaction term and a residue's secondary-structure propensity term. The interaction 
part of tlie scoring function is based in part on the' Lenard- Jones like potential function, 

20 multiplied by the effective attractive inter-residue contact energies The energy is 
summed over ail possible pairs of residues in the protein where the energy between 
each pair residues is a sum of the interaction energies between all possible pairs of 
virtual atoms except for the energy between two neighboring Ca virtual atoros that do 
not contribute to the interaction. Since each side chain is represented by one or more 

25 virtual atoms, the energy contribution of each residue-residue pair is divided among the 
vntual atoms such that the sum over energy contributions of the virtual atom pairs (one 
from each residue) equals the effective residue-residue interaction. Tbe Lenard-Jones 
potential function is modified to malce the effect of repulsion smaller (because the 
virtual atoms are 'softer* than real atoms). 
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The effective contact energies between two amino acids may be, for exan^ple, 
those calculated by Miyazawa and Jernigan [Miyazawa S- and Jemigan R.L. J. Mol , 
BioL 256:623-644 (1996)]. The basic assmption on which the contact energies are 
calculated according to this model is that the average characteristics of residue-residue 
5 contacts^ observed in a large nunaber of crystal stmctures of globular proteins, 
represents the actual intrinsic inter-residue close contacts of protein structures. 

Secondary structure propensities are also included in the scoring fimctiQn. To 
this end, the total energy score of the protein is calculated by adding a residue spec) jSc 
"potential*' for a-helical and jj sheet states. Tb.ese terms may be, for example, &ose 
10 calculated by Bahar et ai [Bahar L e/ ai Proteins 29:292-308 (1997)]. These so-caUed 
potentials are added only if the residue is situated in a a-helical or a p-sheet regions of 
the 3D backbone template, according to the secondary structure of tfie designed protein 
or polypeptide. 

The scoring fiinction is applied as part of a Monte Carlo simulation which 
15 combines a search in the sequence space for amino acid residues and in tiie specific 
rotamer space of each residue. This process provides the system, with the optimal 
sequence for a gj.ven backbone. The term ''optimal sequence" refers to an amino acid 
sequence compatible with the predefined 3D strucmre and having the lowest total 
score. . 

20 The term "Sequence space ' refers to the total number of possible different 

sequences for a given number of different residues and a given number of residues in the 
protein, polypeptide or any other appropriate polymer, e.g. for a protein of 100 residues, 
composed of 20 different amino acids, tb.e sequence space will contain 20^*^^ possible 
sequences. The term ''Rotamer space'' refers to the total nmnber of physically 

25 permissible conformations for a residue in a given ainaino acid sequence^ 

The advantage of the combined reduced representation of the side chains and the 
grouping of amino acids and structure sites according to the solvent accessibility relies in 
the high efficiency of searching through both sequence space and rotamer space. The 
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combined simplifications, dramatically reduce the search space, while retaining a 
physically reasonable representation that can accurately account for rotamer flexibility. 

Tlie search in sequence space begins when up to three positions along the protein 
are simultaneously randomly selected and replaced with different amino acids (each 
5 replacement referred to as a 'mutant' in that specific 'trial configuration'). The replacing 
amino acids are selected randomly from the group of amino acids having the same 
characteristics (buried, exposed or intermediate) as defined for tiie specific replaced 
positions to form th.e 'mutant. If any of fb.e new mutation residues has more than, one 
virtual, side chain atom, a search in rotamer space begins by calculating the total energy 

10 score of the new sequence for each and every allowed rotamers or rotamer combinations 
of the mutated amino acids (not all rotamers are allowed, as described by Ponder and 
Richards (Ponder J.W and Richards E M. L MoL BioL 193(4):775.79l (1987)]). The 
energy score difference aE, between the lowest energy score of the trial c-onfiguratioa 
being the lowest energ>' score among all allowed rotamers of the new mutant (or mutants, 

15 if more than one amino acid is replaced), and tihie energy of tlie last accepted 
configuration is calculated. The Metropolis algorithm (Metropolis N. & TJIam S. J Am. 
Stat Ass, 44;335-341 (1949)] is used to determine whether the new trial, sequence is 
accepted or rejected. If aE is negative, the mutation is accepted with the best rotamers, 
otherwise, the trial configuration is accepted at a probability determined according to the 

20 Boltzmann distribution c'^^ (T being ei^er a fixed or a varying annealing temperature 
as will be described hereinbelow). 

Tlie search continues throu^ a large number of trials (steps) in order to allow the 
score to decrease and converge. This number depends on the size of the protein and on 
the number of residues that are allowed to be mutated (in case the design is just of a 
25 certain part of tlie protein). 

It is possible to perform multiple mutations m each simulation step with 
adjustable probabilities to determine whether one or more mutations take place at any 
given trial step. If two or more m.utatjons take place simultaneously, the minimal energy 
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score among all rotamer combinations of those mutations in the new sequence that has 
side chains of two or more vijtuaj atoms, is searched. 

Throughout the simulation, tlie Iciest ''scored" sequences are selected. Tne final 
optimal sequence is the one associated with the lowest total energy score found during 
5 the. optimization process. The resulting sequence is then expanded to Jts corresponding 
3D all-atom representation (as opposed to the virtual representation), Tnis all atom 
representation, may then be either saved in a computer readable form or extracted in the 
form of a computer output. Also collected are additional low energy score sequences^ in 
order to enable analyzing relative consistency patterns of residues in a given position. 

10 To evaluate whether the sequence obtained is indeed compatible wiUi the 

predefined 3D structure, the ail atom 3D model that is constructed from the novel 
sequence can be analyzed. There are several methods for detsmaining whether a designed 
amino acid sequence indeed folds into a predefined 3D structure. One way of perfoiming 
such an evaluation is to compare the structure of the designed protein (after standard 

15 all-atom minimization of its side cMins with the structure of the model after molecular 
dynamics simulation in water or by comparing the molecular mechanics energy of the 
wild-t^-pe protein fromi which the 3D structure used, was taken, with that of the designed 
protein, after molecular dynamics of the latter. Molecular dynamics programs, such as 
CHARMM [Brooks, B.R- &t ai 1 Comp., Chem, 4:187^219 (1983)] may be utilized for 

20 this purpose, as illustrated in the following Examples. 

The amino acid sequence designed by the m_etb.od of the invention is a de novo 
sequence and preferably a sequence, which under physiological conditions folds 
substantially into the desired 3D structore. More preferably, the amino acid sequences 
obtained are biologically functional. 

25 The sequence obtained may be used for various applications. According to one 

preferred embodiment, •tt)e designed amino acid sequence is chemically synthesized by 
procedures known in the art. 

Tn a further prefeixed emhodiment, the novel amino acid sequence is used to 
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create a nucjejc acid sequence, such as DNA, which erACOdes the optimal sequence. A 
man versed in the art would know based on the existing technologies how to deduce at 
least one nucleic acid which will encode tb.e amino acid sequence designed. The 
nucleic acid sequence obtained may then be cloned into a host cell and expressed. The 
, 5 choice of codons, suitable expression vectors and suitable host cells may vary 
depending on a number of factors, and can be easily optimized as needed. 

Once made, the novel amino acid sequence may be experimentally evaluated 
and tested for structure, function and stability, as required. This will be performed as is 
knov.Ti in the art and will depend in part on the original protein from which the 

10 sequence's backbone structure was taken. Preferably, the designed protein will be more 
stable than the known protein used as the starting point although, at times, if some 
constraints are placed on the nietliod disclosed herein, the designed sequence may be 
less stable. For example, it is possible to fix certain residues for altered biological 
activity and find the most stable sequence, but it may still be Jess stable ilian the v/ild 

15 type protein, Stal^le in tliis context includesj but is not limited th.ereto, thermal stability, 
i.e. an. increase in the temperature at which reversible or irreversible denaturing starts 
to occur; proteolytic stability, i,e. decrease in the amount of protein which is 
irreversibly cleaved in the presence of a particular protease (including autolysis); 
stability to alteration in pH or oxidative conditions; chelator stability; stabiiiiy to metai 

20 lons; stabihty to solvents such as organic solvents, surfactants, formulation chemicalSj 
etc- 

The proteins of the invention, and naturally, the nucleic acid deduced therefrom, 
may be used in a variety of applications, ranging from industrial to pharmacological 
uses, depending on the protem. Example of the different uses are in biotechnology 
25 manufacturing of therapeutic peptides and proteins, in gene therapy, design, of 
modified therapeutic peptides and proteins as pharmaceuticals, etc. 

An.other application of the invention disclosed herein may be tlae generation of a 
library of smai) stable protein elements that can be later assembled in various ways to 
design a sequence for a novel larger protein with a desired 3D structure. Yet further. 
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the method of di.e present invention may he applicable for optimizing the novel larger 
protein obtained thus ensuring that the peptides from which it was constructed indeed 
f?t the structure. 

In view of uie above, tlie invention further provides amino acid sequences 
5 substantially compatible witli a specified 3D structure, the amino acid sequences being 
obtained by the method of the present invention. 

Yet farther, in accordance with another of its aspect, there is provided a 
computer-based system for predicting an amino acid sequence compatible with a 
specified 3D structure, the system comprising the constituents as defined hereinbefore 
10 and after. 

The input apparatus, such as a keyboard, employed by the system of the 
invention, are used for entering a selected set of coordinates representing the 
predetined 3D stmctare and other data such as scoring ftinction and optimization 
process parameters. The first and third memory means being preferably a RAM 

)5 (random access memory) are used for storing tlie initial and final data while the second 
memory means^ being preferably a ROM (read-only memory) are used to store the 
program of the method of present invention. Further, tb.e system comprises a 
microprocessor for performing, under control of the stored program, the steps of 
processing the entered data and displaying via a display unit or printer the novel amino 

20 acid sequence. 

A user enters the coordinate set for the predejSned 3D structure Srom an 
optj.onal, auxiliary storage unit. In response to entry of the coordinate set, the system 
inputs the data for processing, stores the data in memory then processes it as described. 
The data provided rsgarding the 3D structures is ^'pically retrieved from existijig fi]ss 
25 known to those versed in the art and available to them. 
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SPECIFIC EXAMPLES 

The present invention, is defined by the claims, the contents of which are to be 
read as included within the disclosure of the specification, and wiU now be described by 
way of example with reference to the accompanyin.g Figures. 

s GENERAL 

CHAjRMM mmimizathn and molecular dynamics 

The comparison between the all-atom 3D structure of tiie designed protein (after 
minimization of its side chains) with irs structure after moiecular dynamics simulation is 
carried out tn the follo\ving specific Examples using flie CHARMM molecular dynamics ' 

10 program [version 29, Brooks B,R, et al (1983) ibid]. Further, the comparison between 
tlie averaged energy of me designed protein after d>Tjamic3 v/ith the snergj' of the 
native protein is carried out using CHARMM forcefjeld [Mackerell A.D. et al 1 Phys. 
Chem, 102:3586-361$ (i998)J. The minimization and the molecuiar dynamics are 
performed when the proxein. is embedded in a water sphere. For native proteins, the 

15 coordinates sxe based on the information provided torn PDB, The conformation of the 
designed protein is composed of tb.e backbone contoraiation of the native protein and 
the side chains conformation of the new residues, according to the best rotamers 
chosen by the method of the invention. 

CHARMM executes two minimization ajgorimms to the protein's side chains, 
20 Steepest Descent (SD) md Adopted Basis Nev,ton Raphson (AJBNR). AJ&er the 
minimization of the side chains and of die water surrounding the protein, CHARMM 
performs molecular dynamics of the protein- 
Example 1 " Zif268 as a target fold 

In order to examined the method according to the invention the j3pa motif 
25 typified by the zinc finger DNA bmding module in the zinc finger protein, Zif268 was 
used. Zif268 is a well recognized protein. This protein is sm.all enough to be both 
computationally and experimentally tractable, yet large enough to fonn an 



independently folded structure iv the absence of disulfide bonds or metal bjndmg. 
Although this motif consists of fewer than 30 residues, it does contain sheet helix and 
turn stnictures. By the method and system of the mvention the entire amino acid 
sequence: the buried core, the solvent exposed surface and the boundary bet^veen core 

5 and surface, except for the Gly27, which was not mutated during the simulation, was 
computed. The input coordinates are those of residues 33-60 of the native proteins 
obtained from the X-ray structure coordinate of Zif268 immediate early gene (krGx-24) 
complex with an 1 1 bose pair DNA fragment (Protein Data Bank (PDB) code: IZAA), 
as determined at 2.1 A resolution by Pavletich and Pabo [Pavletich Pabo Science 

!0 252:809-817 (1991)]. Recently, this protein was also analyzed by Dahiyat & Mayo^'^l 

Solve^ti accessibility ofZif26B 

Solvent accessibility (SA) was determined based on the coordinates of the 
native protein and each residue along the chahi v/as assigned a level of SA*buried, 
-exposed or -intermediate. Table 16 presents a comparison between the SA obtained 
15 according to one embodiment of the present the invention and those obtained by 
Dahiyat and Mayo^*^^ 



Table 1: Solvent Accessibility otZif268 
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Secondary structure containing Extended sheet (E), Turn (T) or Hejjx (H); 
Zif268 >^Hd-type sequence written in one letter code; 



20 '"''Solvent accessibility as deiermined by the inventfon, categonjced as hx!xi^^ (b), exposed (s) or 
rntermcdiate (i); 
Solvent accessibility as determined by Dahiyat 3>id Mayo*^^^^ 
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As may be seen from Table 1, there are only two differences m Zif263 solvent 
acce$sibiiity obtained by one embodimeni of the present invention and by D&M, the 
latter employing the con3:ioUy algoritim [Connolly M, Science 221:709-713 (1983)] 
with subseqiient mantia] changes in SA assignment of positions 1, 17 and 23 from the 
5 boundary class to the exposed class, position 4 is classified as an intennediate residue 
by the prQS^nt inventiou s algorithm and as an exposed residue in D&M's work, and 
position 7 is classified as an exposed residue by the present invention's aigorithm and 
as m intermediate in D&M's work. 

1.2 The Hpct motif optimization process of the energy score profile 

10 A Monte Carlo (MC) search was conducted as described hereinbefore and the 

profile of the score as a fiinction of MC trials is calculated. The results which 
depended on the temperature of the system, T, are presented in Figures 1 and 2. 
Figiite 1 presents the energy profile at three constant temperaiure parameters, lOOK, 
500K and lOOOK. Figure 2 presents tlie ener^- profile using an annealing profile of the 

J 5 temperature parameters. The maximal temperatures were also lOOK, 500K and lOOOK 
and tlie periodicity was 500 Monte Carlo steps. Namely, during each cycle of 500 MC 
steps -tiie temperature parameter is gradually reduced until it reaches zero^ at which 
point the temperature parameter is set again to its initial value for a new 500 step 
annealing cycle to begin. The total size lof the search space was 3,41x10'^'^ but in all 

20 cases within less than 2000 iterations the algorithm reached the range of stable 
sequence between -270 kcal/mol and -300 kcal/mol, according to the scoring function. 

It can be seen from Figs. 1 and 2 that the optimization reaches lov/er scores 
when tile periodic annealing temperature profile is used. This profile enables the 
program, to escape local minima by accepting bigh-energy sequences that would not be 
25 accepted, but at the same time, to optimize locally when the temperature is reduced. 
Miong the tliree temperatures examined, tlie search reached the lowest values for the 
scoring fimction initial temperatwe parameter was lOOOK and an annealing profile was 
used. 
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Figure 3 shows the energies of the 20 lowest sequences generated by the 
algoritlim witli different simuJation lengths and different temperatures, using an 
annealing temperature profile with a periodicity of 500 Monte Carlo steps, Tne results 

5 at lOOK after 10^^ iterations and reached different energies each time. 

Tne length of 10^ iterations of the ^inc finger protein simulation required one 
CPU hour on a single alpha processor workstation, and about L5 hours on Pentium 

in PC. 

L3 Resmis of^v*^ motif design 

10 A total number of 50 simulations of the program were performed each one 

terminated after 10^ iterations under the same temperature conditions: an annealing 
temperature profue with an initial tetupetature of lOOOK.. Several different lengths of 
Monte Carlo steps periodicity were tested and it was foxmd that the best annealing 
periodicity for a simulation of 10^ iteration was 10^ steps. 

15 Each simulation began with a different random seed but, with the same 3D 

backbone template. The set of 50 simulations was repeated twice, each set with 
different solvent accessibility (SA) assignment for the protein residues. The first set 
used the present invention's automated solvent accessibility algorithm and the second 
set used Dahiyat and Mayo's (D&M) fitted assignments (see Table 1). 

20 Tables 2A and 2B present the lowest energy sequences obtained in the first ari.u 

second sets (A and B respectively), aligned with the second zinc finger module of the 
DNA binding protein Zif268 and with D&M designed sequence, FSD-l; Hie 
coordinates used for the FSD-1 ppa motif score evaluations are the experimental NTvlR 
coordinates (PDB code IFSD), which were .found by D&M^^^^ All the energy scores in 

25 Table 2B were calculated according tlie method of the present invention's reduced 
representation of amino acids and its scoring fanction. A and B scores were found to be 
lower than both Zif268 score (without considerix)g the His^Cys^ Zn-binding 
interactions which are not included in the scoring function), and the FSD-1 score. The 



energy score of the most stable sequence, A, is -351.8kca!/nioL This score is lower than 
Z}f26S score by 1 llJkcal/moI which is a significant difference (not tacjcng into 
account the Zn interactions). The relative stability of both A and B sequences in 
comparison to the FSD-1 sequence, may be in part due to the fact the FSD-1 sequence 
s was desired with a different scorins fiiciition. 

Table 2A- The most stable sequence obtained for the Zinc finger 
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Secondary structure containing Extended sheet (E), Turn (T) or Heiix (K); 

Solvent Accessibility as determined by tlic invention, categorized as buried (b), exposed (e) or 
intermediate (r) (all other positions are exposed); 

i 0 The sequence as designed by D&M^^"^^; 

Wiid-type Zif26S sequence written in one lettdr code; 

Sequence obtained by the present invention using the SA caJcuJation described herein; 
'''^ Sequence obtajned by the present invention using uSoM SA fitted assignjnents. 
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Table 2B - Energy scores of the sequences presented in Table 2A 



Sequence 


Energy (kcal/mol) 


FSD-]/*' 


-166.5 


Wild type Zif268 (no Zn^*) 


-240.6 




-351.9 


B(3) 


-348.7 



i he sequence designed by D&ivi*^^-; 

The sequence obtained by the present invention using SA calculations as described. 



L 4 Analysis of the resulting sequence 

Statistical calculation-s over the 50 sequences obtained in 50 sLmulatiQns (data 
not shown) provide the following observations:- 

10 i. JOWii UJVU.gJt>l. till VA. Ui.w i.fy\i.u\/yMi\ji^Ly^ cuiixAW c*.wa\jio vifv/i.v^ aji.v>^ wu- at UIW 

intermediate positions, the algorithm selected only non-polar amino acids 
at aii those locations. This agrees wejj with the finding that these fonn a 
v/ell-packed buried cluster [Dahiyat S. I. (1997), ibid.]. 

2. For positions 5^ 8, 21, 25 6f the original Zn-binding amino acids (two 
15 cysteines (C) and two histidines (H), the algorithm consistently selected 

residues of a well defined solvent accessibility character (even at 
"intermediate" positions). Hydrophobic amino acids (Va), Phe, Leu, lie, 
and Met) were selected for positions 5, 2i and 25 which are classijSed as 
either "buried'' or "intermediate", Ifi the single exposed position 
20 (position 8)5 a hydrophilic amino acid was selected (His), 

3. Positions 2] and 25 of the optimal sequences Vr'ere selected to be Phe or 
Met (position 21) and Leu (position 25) side chains. In the original 
Zif26S, these positions were occupies by the zinc binding His residue. 
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These positions are more thai). 80 percent buried. Position 5, whicli is 1 00 
percent buried, was predominantiy selected to be Val. The other 
boundary positio,T35 demonstrate the steric constoins on buried residues 
by packing similar side chains to tb.ose of the original Zif268 sequence. 

5 '4, In the heJix region (residues 15-26) the algorithm placed r^^'o Leu side 
chains and one Gin, which are good helix forming residues, in sequence 
B. and one Leu and one Gin in sequence A, - 

5. In both A and B sequences, position 5 on the exposed sheet surface was 
selected by the algorithm to be Val, which is a very good [3-sheet forming 

10 residue, and positions 4 and 10 (and li only in sequence B) were 

selected to be Thr, which is also a good p-sheet forming residue. 

6. Alignment of tJie optimal stable sequence (B) and Zif268 indicates diat 4 
out of 27 residues (not including residue 27 that remains Giy throughout 
the sirD.ulatjon) are identical. (15%) and 11 are similar (including the 

15 identical 40.7%). D&M obtained similar values, with 5 identical residues 

7. Alignment of the sequence B and FSD-1 indicates that 5 out of 27 
residues are identical between the sequences (15,5%) and li are similar 
(including identical 40,7%). 

20 I' S Secondary structure prediction of the designed sequence 

Sequence A and B were further exanimed by secondary structure prediction by 
the SSPAL predictor at Sanger Centere [Salamov A. A. and Solovyev V V. .T MoL 
Biol. 247:11-15 (1995)], which enable to predict the secondary structure of a protein 
according to its primary strjcrare (amino acid sequence). By these programs both A 
25 and B sequences were predicted to have the desired Zinc finger motif Table 3 presents 
the secondary structure of the native protein (Zi£268) according to the Protein Data 
Bank (PDB), 3n.d A 2n.d B secondary stracture prediction, according to SSPAL 



algorithm at Sanger Centre. A was predjcted to have one a-heiix (designated H) and 
iwo j3-slraiids (designated £)(t])e ppa motif) while the predicted secondar}^ stmcturs to 
B contained only one a-he]ix and one j3 -strand. 

Table 3 - Secondary structure of predicted primary structures^ and B. 



Position 

A 
B 
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Secondsry Structure 



/. 6 Molecidar dynamics of the re-designed sequences 

The reduced representation of the lowest energy designed sequences A and 
was expanded to an all-atom representation, using the molecular mechanics package 
CHARivSvj- Tne input for this experiment was the backbone coordinates of the native 

10 protein, the new designed residues and the dihedral angles of each position along the 
designed sequence derived j&om the rotamer with the lowest energy score. The number 
of atoms of A and B after expansion to all atoms, were 459 and 446, respectively. 
Energy minimization was performed for A and B's side chains as well as to Zif268 side 
chains using CHARMM forcefield, tlie SHAKE algorithm p/an Gunsteren W,F. & 

15 Berendsen HJ.C. Mol Phys, 34:131 1 (1977)], a dielectric constant of £-=1 and a 12A 
energy cutoff The minimization included 200 steps of SD (Steepest Descent) and tlaen 
additional 500 steps of ABNR (Adopted Basis Newton Raphson). After minimization, 
each of the three stmctures were embedded in an ISA water sphere which included 
--1870 water molecules of type TIP3P [Jorgenes WX. et al J Chem. Phys, 79:926-935 

20 (1983)], Each of the water-protein systems of A and B were simulated for 500ps at 



300K (with a 16 A energy cutou) and a sample of 2000 corxformations was collected 
froiu the resulting molecular dyjDamics trajectory. It was found that the secondary 
structure of the proteix).s was maintained during the molecuiar dynamic simulations. 

Tne root-iTieau-square (rms) difference behveen the protein structure of the 
5 designed sequences before the m.olecular dynamics simulation and tlieir protein 
structure after the molecular dynamics simulation was:- 

Sequence; Backbone: Side Chains: Total: 

A 1.84A 3.05 A 2.69 A 

B IMA 2.7JA 2.43A 

These results clearly indicate that the overall fold of the designed proteins 
remained pj3cc with some relaxation of backbone and side chains. 

The molecular mechanics average energies of tlie two protein sequences during 
] 0 the simularions were> 

A:- -43 7.4 kcai/mol 

B> -196,4 kcal/mol 

The above energy results indicate that the method and system of the present 
invention is indeed useful for the predicuon of primary sequences that stabilize the 

i 

p(Ja motif even in the absence of the ion. 

The differences betvireen the energy scores calculated for sequences A and B by 
15 the scoring fimction of the present invention and by the CKARMM force-field (after 
dynamics) were only 20% and 44%, respectively, these results indicate that the scoring 
function of the present invention provides a satisfactory evaluation to the potential 
energy of the designed sequences. FurihermorCj in both CHARIvIM force-field and the 
scoring function of the present invention, sequence A yielded a significantly more 
20 stable structure that sequence B. 

For comparison of the zinc bound wild-tj-pe Zif268 was also simulated using a 
system of solvated Zif268 wherein only the water sphere was allowed to move, while 



tlie protein cooxdinatss were kept fixed. The system wajs simulated for lOOps at 300K 
and a sample of 400 conformations was collected from tbe resulting molecular 
dynamics trajectory. Tne average CKARMM eaergy for Zj.f268 was -1657.6 kcal/'mol 
(Zn interaction "with its four anchor residues contributing: —298,719 Icc-aL/moL The 
5 reason for the great difference witli respect to the Zif268 energy score by the fimctiOD 
of the present mvention compared to its CHARMM energy is that die scoring function 
calculation of the native protein does not consider tlie contribution of the 

2 2 

His Cys Zn^biading interactions. 

Figures 4A»4C show the 3D structure of Zif268 as compared to that of the 
\ 0 designed protdns A and B after minimization of their side chains, focusing on the core, 
-which includes hydrophobic side chains in A and B, instead of the zinc ion chelated by 
two cysteines and two histidines in the native protein. The same structures are 
presented in Fig. 5 but with the core side chains displayed by spheres sized to the van 
der Waals radii of the atoms, which m.dicate the good packing of the core in the 
15 designed sequences. 

Example 2 - GP J. as a target fold 

Tlie core of pi domain of Streptococcal protein G (G&i), a 56 residue protein, 
was examined. GP/ is derived from z, larger multi-domain cell surface protein that 
fiinctions with high affinity binding to the Fc region of JgG. It comprises six [5-strands 
20 an.d one a-helix. An extremely hyperthermophibc variant of the ^1 domain of 
Streptococcal protein G was already reported by Mayo and collaborators ^'^^'^^\ 

2 J. Solvent accessibility for G^I domain 

Solvent accessibiiiiy was evaluated as described in Example 1. A comparison of 
the results obtained by the method of the present invention and that of Malakauskas 
25 and Mayo (M&M^^^\ using mainly the Connolly algorithm referred to hereinbefore) is 
presented in Figure 6. As can be seen from this Figure, 22 positions were found to be 
exposed in both cases, 11 were found to be buried and 10 in. an intermediate level of 
solvent accessibility. The 13 remaining residues were classified differently in the two 
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ijiethods. In 10 of tliese cases the discrepancy was bet^veen the subtle deSiiition of a 
site as being '"'buried'' or 'intermediate". Tlie solvent accessibility of the. eighth 
position {position 8) selected for optimization was identical in two classification 
scberr.,e$. 

5 Z2 Results of G^J. core design 

The energy score profile for G^I was obtained by the sanie manner as described 
for Zif268. Farther, the ener^' score profile obtained for G^l was similar to that 
obtained for Zif268 (Figs. 1 and 2), using the same temperature conditions. However, 
since only 8 out of 56 residues (143%) were mutated, the initial energy was akeady 
10 negative vt^hiie the final energies obtained were approximately -770 and -790 kcal/mol 
when an annealing temperature profile was used and the maximal value for the 
temperature parameter was lOOOK. 

Two sets of 50 simulations were conducted. In the first set of simulations, the 
non-mutated positions were kept in their native rotameric conformation whiJe in the 

15 second set of simulations, the rotameric states of the side chains of the non-mutating 
residues were allowed to change, thus providing a larger number or possible mntations 
and rotamer combinations. The total, size of the search space in the first set was 
1.06x10^^ and in the second set was 2,52xl0^^ Each simulation in the first set was 
terminated after lO"^ iterations with a m.aximal temperature of lOOOK and an annealing 

20 periodicity of 100 MC steps. Each simulation in the second set was terminated after 
10 iterations, with a maximal temperature of lOOOK and an annealing periodicity of 
1000 MC steps. 

Table 4 presents the mutated residues m the lowest energy sequences among the 
50 simulations conducted for each set (C and D respectively). 
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Table 4 - Mutated residues 
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23 Analysis of the resulting sequences 

Statistica] calculations over the 50 sequences obtained in 50 simulations in the 
fixed ajid non-fixed conformations, the latter having rotanaeric iBreedom (sets C and D, 
data not shown) provided the following observations:- 

1 . In 72%-76% of the sequences buried positions 7 and 39 were found to be 
Leu and Vai, respectively, as in the native protein. 

2. In most sequences TIk25 was mutated to Leu, which has a better helLx 
propensity. Examining of sequences C and D by SSPAL predictor at 
Sanger Centre (as described hereinbefore) and by the PHD predictor at 

• 15 EIvffiL [Rost B and Sander C. L MoL Biol 232:584-599 (1993)] showed 

that mutati.ons T25L and V29L maintained the secondary structure of the 

a-helix, 

3. More than 80% of the sequerxes consisted of the mutations T16L, T18L 
and T18F, which are located in a (J-strand. The Leu and Phe apparently 
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c?sn substitute from the \vild4ype Thr without compromising the jS-sheet 
propensity^ 

4. The buried position 3 changed from Tyr mainly to Leu, which is more 
hydrophobic and is predicted to improve side chain packing in the 

5 interior of the protein.. 

5. Four percent of the 50 sequences in the first simulation set,' where flie 
coordinates of the non-mutated residues were kept fixed, had Trp in 
position 43, me same as in the native protein. In the second simulation 
set, this position was predominantly assigned with Phe residue. 

1 0 2.4 Minimization of the redesigned sequences in comparison to native G^l 

The reduced representation of the designed sequence C which was the sequence 
with tb.e lowest energy score in the first set of simulations, where the conformation of 
48 non-mutated residues was kept fixed, was expanded to its all-atom representation, 
using CHARJviM- [Brooks et al (1983) ibid], in the same manner as described 

15 hereinbefore. In general, the information provided as input was the backbone 
coordinates of tib.e native protem, the new residues and the rGtam.er cdhedral angles at 
each position along tb.e chain. The number of atoms in sequence C was 865. Energy 
minimizatiou was performed for tlie side chains of this sequence as weil as for GpTs 
side chains using CHARMM force-ileld [Mackerel A.D. et Al (1998) ibid,], a 

20 dielectric constant e==l and a I6A energy cutoff. The minimization included 800 steps 
of SD and then additional 1 100 steps of ABNR. The average energies of the sequences 
after minimization were:- 

G^h -804.2 kcaJ/mol 

C: -822.7 kcal/mol 

The above results suggest that the mutations obtained by the methodology 
disclosed herein are tolerable which may lead to the designed of a more stable protein. 
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The difference between the energies of tlie native sequence G3I and sequence 
C, based on the present invention's scoring function and on CHARMM's force-field 
(after dynamics) were 7% and 16% respectively, which strengthens the conclusion that 
the method and system of the present invention provide a reliable tool for designing de 
5 novo proteins. 

The above statement is further strengthened in light of Figure 7, which show a 
comparison of the 3D structure of G3I and sequence C 

Tliese results show that re-dssigning five boundary residues and three buried 
positions in the core of p 1 domain of Streptococcal protein G was tolerable and that a 
10 stable, fully folded de novo protein may be obtained. 

While foregoing description disclosed in detail only a few specific 
embodiments of the invention, it will be understood by those skilled in the art tliat the 
method and system of the invention is not linoited for the design of these proteins. 
Further, it should be undei^tood that other variations of the method and s>'stem of the 
15 • iJiventior? may be possible without departing from the scope and spirit of the invention 
as hei'eir) disclosed. 



-31 - 



CLAIMS 

1. A computer-implsnisnted method for predicting at leasi one amino acid sequence 
coTTxpatible with a specijaed tbiee-dimensioEal (3D) stnicture of a protein or peptide, 
which method comprises tJie steps of:- 

(a) providing a coordinate set representing the backbone of said 3D 
structure; 

(b) constructing a reduced virtual representation for the 3D structvire 
provided in step (a); 

(c) determining for each position aiong tlie virtual structure representation 
provided in step (b) its solvent accessibility; 

(d) constructing an initial amino acid sequence by randomly assigning for 
each position along the structure an amino acid residue selected randomly from 
a predefined group of amino acids having a solvent accessibility compatible 
with the. solvent accessibility' of sai.d position; 

(e) randomly selecting one or more positions along the sequence provided in 
step (d) and applying on each position a Monte-Carlo simulation in sequence 
space and rotamer space, said simulation comprising one or more scoring 
function calculating steps which inciude:- 

i) randomly selecting one or more amino acid residues of the same 
solvent accessibility as that defined for said position to obtain a 
mutation; 

ii) calculating an energy scoring function for each possible rotamer 
of each amino acid residue provided in step (i) based on their said 
reduced virtual representation; 

iii) selecting the lowest scoring rotamer or when more than one 
amino acid is manipulated simultaneously, selecting tbe lowest 



scoring rotamer combination; 

iv) determinmg whether to accept or reject the mutation with the 
retainer or retainer combination selected in step (iii): and 

v) assigning the amino acid residue or residues and their respective 
selected rotamer or rotamer combinations selected in step (iii) to said 
position/s and moving to another position along the sequence; 

said simulation steps axe repeated until for each position aiong said sequence, 
the residue and residue's rotamer v/itb the lowest energy score is selected, to 
obtain a virtually represented amino acid sequence with tlie lowest total en.ergy 
score; 

(0 expanding the reduced representation of the virtually represented amino 
acid sequence obtained in step (e) to its corresponding all-atom sequence 
representation thereby obtaining an amino acid sequence compatibie with the 
predefined 3D structure. 

(g) optionally, creating a computer output of the expanded all-atom 
representation of the primary structore/s obtained in step (f), 

2. The method as claimed in claim 1, wherein the 3D structure provided in step (a) is 
tliat of a native peptide, or protein, or of a designed protein. 

3. The method as claimed in claim wherein said coordinate set is provided in a 
computer readable form. 

4. The method as claimed in. claim i, wherein said amino acid sequence may comprise 
naturally occuiring amino acid residues, syn.thetic amino acid residues, or variations of 
said naturally occurring or synthetic amino acid residues. 

5. The method as claimed in claim 1, wherein for each position aiong the 3D stracture 
its solvent accessibility is determined according to the extent of exposure of said 
position to the solvent surrounding it, said position being either buried, exposed or 
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inteniiediate position - 

6. The metliod as claimed in claim 5, wherein said solvent is substantially water. 

7. The method as claimed in claim 6, wherein said buried positions are occupied by 
hydrophobic amino acid residues, 

8. The method as claimed in claim 7, wherein said hydrophobic amino acid residues 
are selected from the group consisting of Ala, Tyr, Trp, Val, Leu^ He, Phe, Met, Cys, 
Pro, Gly. 

9. Tue metliod as claimed in claim 5, wherein said exposed positit^ns are occupied by 
hydrophilic amino acid residues. 

10. The method as claimed in claim 9, Vi^erein said hydrophilic amino acid residues 
are selected S:om the group consisting of Lys, Axg, His, Glu, Asp, Gln^ Asn, Set, Tlir. 

Ih Tb.e method as claimed in claim 5. wherein said intermediate positions are 
occupied by either hydrophilic or hydrophobic amino acid residues. 

12. The m.ethod as claimed in claim 11, wherein said intermediate positions are 
occupied by amino acid residues selected from the group consisting ofPro, Lys, Axg, 
His, Glu, Asp, G)ji, i\3n, Ser, Thr, Gly, Ala, Tyr, Trp, Va3, Leu, He, Phe, Met, Cys. 

13. The method as claimed in claim 1, wherein said Monte Carlo simuJation is 
applied simultaneously on up to Three random positions in said sequence. 

14. TTie method as claimed in claim 1, wherein said Monte Carlo step is conducted 
either at a fixed temperature or at a varying annealing temperature. 

15. The method as claimed in claim 1, wherein a de novo amino acid sequence is 
generated. 

16. Die method as claimed in claim I, wherein said amino acid sequence folds 
under physiological condition into a biologically functional 3D conform^ation 
substantially identical to said predefined 3D structure or to a portion thereof. 
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17. The method as claimed in claim 15 or 16, wherein said de novo amino acid 
sequence stabilized said 3D structure, as compared to the native amino acid sequence, 

18- An amino acid sequence which folds under physiological conditions into a 
specified 3D structxire, said amino acid sequence is obtained by the method of claim 1 . 

19. An amino acid sequence according to claim 18, which is biologically functional 

20. A nucleic acid sequence encoding the amino acid sequence of clahn 19 or 20- 

21. A computer-based system for predicting an amino acid sequence compatible 
with a predefined 3D structure according to the method of claim 1, said system 
comprising;- 

(a) input apparatus for specifying said 3D structure; 

(b) a first memory for storing the specified 3D structure; 

(c) a second memory having a stored thereon an application program which 
when running, provides at least one amino acid sequence compatible with the 
specified 3D structure; 

(d) a third memory for storing the at least one amino acid sequence obtained; 

(e) a processor coupled to said input means, and to said firsts second and 
third memories for representation of said amino acid sequence; and 

(f) optionally, a display unit coiipled to said processing means for displaying 
tlie amino acid sequence. 



AESTR.4CT 

A method for predicting an amino acid sequence compatible with a 
tIiree-dim.ensionai (3D) structure of a protein, A reduced virtual representation 
of the 3D stmcture is constructed, and, for each position along the 
representation, its solvent accessibility is determined. For each position along 
the structure, an amino acid residue is randomly selected from a predefined 
group of amino acids having a solvent accessibility compatible with the solvent 
accessibility of the position. A Monte-Carlo simulation is performed on this 
devised protein in which an amino acid at a particular position is sequentially 
replaced with other amino acids having the same solvent accessibility, and an 
energy score is calculated for each rotamer. The lowest scoring rotamer for this 
position is then selected Tlie Monte-Carlo simulation is repeated for each 
position along the sequence, to obtain an amino acid sequence with the lowest 
total energy score. 
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